Dynamics of Learning with Restricted Training Sets 

II. Tests and Applications 



A.C.C. Coolen D. Saad 

Department of Mathematics The Neural Computing Research Group 

King's College London Aston University 

Strand, London WC2R 2LS, UK Birmingham B4 7ET, UK 

September 28th 1999 



Abstract 

We apply a general theory describing the dynamics of supervised learning in layered neural networks 
in the regime where the size p of the training set is proportional to the number of inputs N, as 
developed in a previous paper, to several choices of learning rules. In the case of (on-line and batch) 
Hebbian learning, where a direct exact solution is possible, we show that our theory provides exact 
results at any time in many different verifiable cases. For non-Hebbian learning rules, such as 
Perceptron and AdaTron, we find very good agreement between the predictions of our theory and 
numerical simulations. Finally, we derive three approximation schemes aimed at eliminating the 
need to solve a functional saddle-point equation at each time step, and assess their performance. 
The simplest of these schemes leads to a fully explicit and relatively simple non-linear diffusion 
equation for the joint field distribution, which already describes the learning dynamics surprisingly 
well over a wide range of parameters. 
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1 Introduction 



In a previous paper [|J we have applied the formalism of dynamical replica theory M to analyse the 
dynamics of supervised learning in perceptrons with restricted training sets. For an introduction into 
the area of the dynamics of learning in layered neural networks, a guide to the relevant references, as 
well as a proper discussion of the peculiarities of the dynamics of learning with restricted as opposed 
to infinite raining sets, we refer to ||]. The microscopic variables in the learning process are the 
components of the weight vector J of the 'student' network. The 'teacher' network, which defines 
the task to be learned by the student, is characterised by a weight vector B. Learning proceeds on 
the basis of answers to be given to questions £ £ D C {—1, 1}^, according to a dynamical rule for 
J which is defined in terms of a function Q[x, y], where x = J ■ £ and y = B ■ £ are the student and 
teacher fields, respectively. The randomly composed set D of questions, of size p = aN, is called 
the training set. If a < oo as ./V — > oo the learning dynamics will be nontrivial. Firstly, the data in 
D will be recycled in due course, which generates complicated correlations and non-Gaussian local 
field distributions, and allows the system to improve its performance partly by memorizing answers 
rather than by learning the underlying rule (hence the difference between training- and generalization 
errors). Secondly, the actual composition of the randomly drawn training set introduces an element 
of frozen disorder into the problem, which will have to be averaged out. The analysis described in 
[ffl] resulted in a general macroscopic theory, describing the behaviour of such learning processes with 
a < oo in the limit N — » oo (infinite systems) and on finite time-scales, in terms of deterministic laws 
for macroscopic observables. The theory applies to both on-line learning (where weight updates are 
made after each presentation of an input vector from the training set), and to batch learning (where 
weight updates are themselves averages over the full training set), as well as to arbitrary learning rules 
(i.e. arbitrary functions Q[x, y]). 

In this paper we apply this general theory to various different specific choices of learning rules. One 
of these, (on-line and batch) Hebbian learning, provides an excellent benchmark test for our theory, 
since for this simple rule exact solutions are known, even for the regime of restricted training sets Q. 
We find that our theory is fully exact for batch execution, and that it succeeds in predicting exactly the 
evolution of several macroscopic observables, including the generalisation error and moments of the 
joint field distribution for student and teacher fields, in the on-line case (although here full exactness 
is difficult to assess, and not a priori guaranteed). A preliminary presentation of some of the results 
in this paper (those involving Hebbian learning) was given in Q. For non-Hebbian error-correcting 
learning rules, such as on-line and batch versions of Perceptron learning and AdaTron learning, no 
exact solutions are known at present with which to confront our theory; instead we here compare 
the predictions (with regard to the evolution of training- and generalization errors and the joint field 
distribution) of the full theory, as well as of a number of simple approximations of our equations, 
with the results of carrying out extensive numerical simulations in large (size N = 10, 000) neural 
networks. We find, surprisingly, that even the simplest of these approximations, which does not 
require solving any saddle point equations and takes the form of a fully explicit non-linear diffusion 
equation for the joint field distributions P[x,y], describes the simulation experiments remarkably 
well. Employing the more sophisticated (and thereby more CPU intensive) approximations, or, at the 
other end of the spectrum, a numerical solution of the full macroscopic theory, leads to increasingly 
accurate quantitative predictions for the evolution of the relevant macroscopic observables of the 
learning process, with deviations between theory and numerical experiment which are of the order of 
magnitude of the finite size effects in the simulations. We close our paper with a discussion of the 
strengths and weaknesses of the approach used, and an outlook on future work on the dynamics of 
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learning with restricted training sets, involving the present and possibly other formalisms. 



2 Summary of the Theory and its Properties 



In this section we will be brief in order to avoid inappropriate duplication of the material in 
which we refer for full details. 



to 



2.1 The Macroscopic Laws 

Our macroscopic observables are Q = J 2 , R = BJ , and the joint distribution of student and teacher 
fields (or 'activations') P[x,y] = (5[x — J • £\5[y — B • For N — > oo all these quantities are found 
to obey deterministic and self-averaging equations. We define (f[x,y]) = JdxDy P[x\y]f[x,y], where 

112 

Dy = (27r)~2e~2 y dy, and the following averages (the function will be specified below): 



U = (<S>[x,y]g[x,y] 



V = (xQ[x,y]) 



W=(yg[x,y}) 



(0 2 [x,yj) 



:i 



For on-line learning (which is a stochastic process) we have found, using the prescriptions of dynamical 
replica theory (in replica symmetric ansatz) |IJ|: 

-f-Q = 2t]V + r] 2 Z —R = r]W (2) 

at at 

j f P[x\y] = ^Jdx'P[x'\y] [8[x-x'-rig[x',y\}-8[x-x']} - r?^- \p[x\y] [U{x-Ry)+Wy] 



+ ^v 2 Z^P[x\y] - 7] \v-RW-{Q-R 2 )U 
For batch learning (which is a deterministic process) we found: 



d_ 

Ox 



P[x\y]<f>[x,y] 



j t Q - 2 " v 



—R = rjW 
dt ' 



(3) 



(4) 



jP[Av\ = ~\-q- x [P[x\y]g[x,y]] - r)— lp[x\y] [U(x-Ry) + Wy] 
V-RW-(Q-R 2 )U] ^-(P[x\y}^[x,y] 



(5) 

From the solution of the above closed sets of equations for the trio {Q, R, P} (one of which is a function) 
follow the familiar training- and generalization errors E t = (8[—xy}) and E g = ir^ 1 &lcccos[R/^/(J]. The 
auxiliary order parameters generated in the replica calculation, the spin-glass order parameter q and 
the function M[x|y], are calculated at each time-step by solving the following saddle-point equations: 



((x-Ry) 2 ) + (qQ-R 2 )(l — ) 

a 



2(qQ-R 2 )2 + 



B 



DyDz z{x)i 



(6) 



P[X\y] = Dz (5[X-x]), 



with 



B 



VqQ-R 2 

Q(i-g) 



(f[x,y,z]), 



J dx M[x\y]e f[x, y, z] 



Jdx M[x\y] 



D Bxz 



(7) 



(8) 
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Without loss of generality we can always normalize M according to Jdx M[x\y] = 1 for all y S 3ft. 
From the physical meaning of q follows R 2 /Q < q < 1. After q and M[x,y] have been determined, 
the key function in (|l],§^|) is calculated as 



$[X,y] = \Q(l-q)P[X\y]\ Dz (X-x)*(5[X-x]) 



(9) 



We refer to |T| for full derivations of the above equations. 



2.2 Properties of the Macroscopic Laws 

Some useful properties of our theory are independent of which learning rule <?[x,y] is used. The first 
of these is that in the limit a — > oo, which corresponds to the case of infinite training sets, our theory 
reduces to the simpler formalism initiated in Q and elaborated in papers like || [7j , which is built on 
assuming P[x, y] to be of a Gaussian form (this only happens as a —* oo) [0. Secondly, for any given q 
the solution M[x|y] of the functional saddle-point problem (|^) is unique, and can even be obtained as 
the fixed-point of a converging nonlinear functional map [0 . Thirdly, we note that the first conditional 
moment x(y) = Jdx xP[x\y] of -P[a;|y] of the joint field distribution obeys a simple equation, which is 
obtained from (||) and @ upon multiplication by x, followed by integration over x: 



d 



x(y) - Ry] 



V 



dx P[x\y\Q[x,y\ + r]U[x(y) — Ry] 



dt " a 

where we have also used the built-in property Jdx P[x\y]$[x,y] =0 for all y. 



(10) 



Alternatively we could rewrite our macroscopic equations into Fourier language, i.e. in terms of 
P[fc|y] = Jdx e~ tkx P[x\y] and M[&|y] = Jdx e~ %kx M[x\y\. The functional saddle-point equation, 
giving M[fc], then becomes 

M[k+iBz\y] 



P[k\y] 

and the diffusion equation takes the form 



Dz 



M[iBz\y] 



(11) 



j t \ogP[k\y] 



a \J P[k\y] 



d> x ix'(k'-k)-irjkg\x',y] 

2tt 



d - 1 

+ V^U— log P[k\y] rj 2 k 2 Z — ink 

ok 2 



V-RW-(Q-R 2 )U 



Dz z 



i7]k{W-UR)y 
M[k + iBz\y] 



MUBz] 



(12) 



y/qQ-B?P[k\y] 

This representation will be particularly convenient when applying our theory to Hebbian learning 
rules. We can derive more explicit results for the special class of locally Gaussian solutions, defined as 

-£[*-2(»)] 3 /A a (y) 
P\x\y] = = 

For such distributions the functional saddle-point equation (|7|) can be solved, giving 



M[x,y] 



-i[x-x(i/)] a /^(») 



a(y)V2vr 



(13) 
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with A 2 (y) = a 2 {y) + B 2 a (y). For such solutions to exist, the conditional moments x(y) and A(y) 
must obey the following equation: 



k 2 \r 1 2 Z + 2r ] UA'(y) +2 W z (y) 



du 



2vr 



e 2 



±[u-ikA(y))2-ikv8\x(y)+uA(y), y ] _ A - irj k {Wy + U[x(y) - Ry}} 



V-RW-{Q-R 2 )U 



(14) 



The simple form of (13) allows us to calculate many objects explicitly. In particular: 

x — x(y) 



(x)i, = x(y) +zBa (y) 



$[x,y] 



Q(l-q)[l + BW(y)} 



Finally, for many types of learning rules there are symmetry properties of our macroscopic equa- 
tions to be exploited. Note that in the most common types of learning rules no distinction is being 
made between system errors of the type {x > 0, y < 0} and those where {x < 0, y > 0}. This 
translates into the following property of the function <5[x,y] (note that by the nature of the learning 
processes under study Q[x,y\ can depend on y only via sgn(y)): 

g[x,y] = sgn(y) ^6[xy)G + [\x\] + 6[-xy]G-[\x\]\ (15) 

In particular we find for the most common learning recipes: 

Hebbian : G+[u] = 1, G-[u] = 1 

Perceptron : G+[u] = 0, G-[u\ = 1 
AdaTron : G+ [u] = 0, G- [u] = u 

The form (p^|) implies the symmetry G[—x,y] = G[x, —y], which turns out to allow for self-consistent 
solutions of the macroscopic equations with the following property at any time: 

P[-x\y] = P[x\ - y] (16) 

Combination of (|i~6| ) with (0) and (||) shows that the measure M[x|y] and the function <j?[x,y] must 
consequently have the following symmetry properties: 

M[-x\y] = M[x\ - y], $[-x, y] = -$[x, -y] 

This, in turn, guarantees that the right-hand sides of equations (||) and (||) indeed preserve the 
symmetry ( |l~6|) of the field distribution, as claimed. 
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3 Benchmark Tests: Hebbian Learning 



In the special case of the Hebb rule, Q[x,y] = sgn[y], where weight changes A J never depend on 
J, one can write down an explicit expression for the weight vector J at any time, and thus for the 
expectation values of our observables. We choose as our initial field distribution a simple Gaussian 
one, resulting from an initialization process which did not involve the training set: 



Po[x\y] 



-±(x-R y) 2 /(Q -R 2 ) 



27r(Q -i2§) 



(17) 



Careful averaging of the exact expressions for our observables over all 'paths' {£(0),£(1), . . .} taken by 
the question/example vector through the training set D (for on-line learning), followed by averaging 
over all realizations of the training set D of size p = aN, and taking the N — > oo limit, then leads to 
the following exact result M\. For on-line Hebbian learning one ends up with: 



Q = Qo + 2??ti?o \ - + V t + V t 

TT 



2.2 



1 2 

- + - 

a tt 



R = R +nt\ - 

TT 



P[x\y] 



_ p -\x 2 [Q-R 2 ]+ix[x-R,y] + ^[e' ir i i s s n l!/]_i] 
27T - - - 



For batch learning a similar calculation^ gives: 



Q = Q + 2r]tR \ - + rft 



2 .2 



1 2 
- + - 

a tt 



R = R + Tft\ - 

TT 



P[x\y) 



e 2 



hx-Ry-( V t/a) sgn[y}] 2 /(Q-R 2 ) 



(18) 

(19) 

(20) 
(21) 



y/2TT(Q-R 2 ) 

Neither of the two field distributions is of a fully Gaussian form (although the batch distribution is at 
least conditionally Gaussian). Note that for both on-line and batch Hebbian learning we have 



/7]t 
dx xP\x\y\ = Ry H sgn[ 
a 



(22) 



The generalization- and training errors are, as before, given in terms of the above observables as 
E s = tt -1 arccos[i2/-v/Q] and E t = J DydxP[x\y]9[— xy]. We thus have exact expressions for both the 
generalization error and the training error at any time and for any a. The asymptotic values, for both 
batch and on-line Hebbian learning, are given by 



lim Eg; = — arccos 



s/l + Tr/2a 



lim E t 

t—+oo 



Dy erf 



\y\ 



+ 



i 



(23) 



(24) 



As far as E g and E t are concerned, the differences between batch and on-line Hebbian learning are 
confined to transients. Clearly, the above exact results (which can only be obtained for Hebbian-type 
learning rules) provide excellent and welcome benchmarks with which to test general theories such as 
the one investigated in the present paper. 

1 Note that in W only the on-line calculation was carried out; the batch calculation can be done along the same lines. 



6 



3.1 Batch Hebbian Learning 

We compare the exact solutions for Hebbian learning to the predictions of our general theory, turning 
first to batch Hebbian learning. We insert into the equations of our general formalism the Hebbian 
recipe Q[x,y] = sgn[y]. This simplifies our dynamic equations enormously. In particular we obtain: 



[7 = 0, V = (x sgn(y)}, 

For batch learning we consequently find: 



W 



2/tt 



Jt R = ^ 



d_ 

Tit 



P[x\y] 



77 d [2d [2d 

— sgn(y)— P[x\y] - rjy\ -—P[x\y] - rj(V-R\ -)— 

a OX V 7T OX V 7T ox 



P[x\y]$[x,y] 



Given the initial field distribution ([l7| ) , we immediate obtain Vq = Rq \[2pK. From the general property 
/ dx P\x\y\<b[x, y] = and the above diffusion equation for we derive an equation for the quantity 

V = (x sgn(y)), resulting in = rj/a + 2rj/ir, which subsequently allows us to solve 



Q = Q + 2r]tR \ - + rj H 



,2.2 



1 2 
- + - 

a n 



R = R + r]t 



(25) 



Furthermore, it turns out that the above diffusion equation for P[x|y] meets the requirements for 
having conditionally-Gaussian solutions, i.e. 



P[x\y] 



e 2 



±[x-x(y)} 2 /AHy) 



MyW2TT 



M[x\y] 



e 2 



hlx-xiv)] 2 /^) 



a(y)V27r 



provided the y-dependent average x(y) and the y-dependent variances A(y) and a(y) obey the following 
three coupled equations: 



Tit 

x(y) = Ry+— sgn(y) 
a 



l A 2 (y) = 

dt aQ(l — q) 



^{y) = a\y)+B'a\y) 



The spin-glass order parameter q is to be solved from the remaining scalar saddle-point equation 
(^). With help of identities like (x)* = x(y) + zBa 2 (y), which only hold for conditionally-Gaussian 
solutions, one can simplify the latter to 



+ a Dy A 2 (y) + (qQ-R 2 )(a-l) 



rft 2 
a 



We now immediately find the solution 

A 2 (y)=Q-R 2 , a 2 (y) = Q(l-q), 



a 



qQ-R 2 

2— hi 

Q(l-q) 



Dy a 2 (y) 



q = [aR z +r! 2 t 2 ]/aQ 



P[x\y] 



e 2 



hx-Ry-( V t/a) sgn(j/)] 2 /(Q-R 2 ) 



(26) 
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(this solution is unique). If we calculate the generalization error and the training error from ( |25| ) and 
(HH), respectively, we recover the exact expressions 



E„ = — arccos 

S 7T 



Ro+vhi 



Qo+2r 1 tR Jl+r 1 H 2 



2[Qo 



(27) 



(28) 



Comparison of ( p5| , p6| ) with (2^.21) shows that for batch Hebbian learning our theory is fully exact. 
This is not a big feat as far as Q and R (and thus E g ) are concerned, whose determination did not 
require knowing the function <£[x,y]. The fact that our theory also gives the exact values for -P[x|y] 
and E t , however, is less trivial, since here the disordered nature of the learning dynamics, leading to 
non-Gaussian distributions, is truly relevant. 



3.2 On-Line Hebbian Learning 

We next insert the Hebbian recipe = sgn[y] into the on-line equations Direct analytical 

solution of these equations, or a demonstration that they are solved by the exact result ( p^Jl9| ), 
although not ruled out, has not yet been achieved by us. The reason is that here one has conditionally 
Gaussian field distributions only in special limits. Numerical solution is in principle straightforward, 
but will be quite CPU intensive (see also a subsequent section). For small learning rates the on-line 
equations reduce to the batch ones, so we know that in first order in rj our on-line equations are exact 
(for any a, t). We now show that the predictions of our theory are fully exact (i) for Q, R and E g , 
(ii) for the first moment (22) of the conditional field distribution, and (iii) for all order parameters in 
the stationary state. At intermediate times we construct an approximate solution of our equations in 
order to obtain predictions for -P[x|y] and E t . 

As before we choose a Gaussian initial field distribution. Many (but not all) of our previous 
simplifications still hold, e.g. 



U = 0, V = (xsgn(y)), W = Z 

(Z did not occur in the batch equations). Thus for on-line learning we find: 



The previous derivation of the identities j^V = -q / a+2rj / ir and Vq = Roy / 2/ir still applies (just replace 
the batch diffusion equation by the on-line one), but the resultant expression for Q is different. Here 
we obtain: 



Q = Q + 2r]tRo\l - + r, 2 t + rj 2 t 2 

7T 



1 2 
- + - 

a 7r 



R = R Q + r]t\ - 



(29) 



Comparing (E9) with (18) reveals that also for on-line Hebbian learning our theory is exact with regard 



to Q and R, and thus also with regard to E g . Upon using V = r]t/a + Ry/2/ir, the on-line diffusion 
equation simplifies to 



d_ 

dt 



P[x\y] 



1 



P[x-r] sga(y)\y]-P[x\y] \-r]y 



rjH d 
a dx 



P[x\y]$[x,y\ 
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Multiplication of this equation by x followed by integration over x, together with usage of the general 
properties Jdx {P[x\y]$[x, y]} = and Jdx xPo[x\y] = Roy, gives us the average of the conditional 
distribution P[cc|y] at any time: 

rjt 



/1]t 
dx xP[x\y] = Ry H sgn[y] 
a 



Comparison with (^) shows also this prediction to be correct. 

We now turn to observables which involve more detailed knowledge of the function &[x,y]. Our 
result for x(y) and the identity (x)± = -B -1 J| log M[iBz\y] allow us to rewrite all remaining equations 
in Fourier representation, i.e. in terms of P[/c|y]= Jdx e~ lkx P[x\y] and M[/c|y] = Jdx e~ lkx M[x\y\: 



±\ og P[k\y] = - [ 
dt a 



-ir/k sgn(y) 



, [2 1 o,2 ikr?t 

inky\ ri k - ■ 

V 7T 2 aP[k\yWqQ-R 2 



D: z M ^Bz\y] m 



M[iBz\y] 



with logPo[^|y] = —ikRoy — ijk iQo — Ro), and with the two saddle-point equations 



P[k\y] 



Dz 



M[k+iBz\y] 
M[iBz\y] 



(31) 



rft 2 



+ \Dy jdx P[x\y][x-x(y)Y + (l—-)(qQ-R 



2Q{l-q) + — 2 



d 2 

DyDz ^ log M[iBz\y] (32) 

Since the fields x grow linearly in time (see our expression for x(y)) the equations ([K], 32 , 31]) cannot 
have proper t — > oo limits. To extract asymptotic properties we have to turn to the rescaled distribution 



= P[k/t\y]. We define v(y) = (rj/a) sgn(y) + r]yy/2/ir. Careful integration of (J30|) , followed by 
inserting k^ k/t and by taking the limit t^oo, produces: 



logQoo[%] = -ikv(y) 



irfk 



du lim 



a Jo 

with the functional saddle-point equation 

Q[k\y] = 



t-^oo y/qQ-R 2 
M[k/t+iBz\y] 



Dz z 



M[uk/t+iBz\y] 
Q^uklyjMiiBzly] 



Dz 



M[iBz\y] 

The rescaled asymptotic system (|33 , 34 ) admits the solution 

Q[k\y] = e ~ikv(y)-^A^ M[k\y] -- 

with the asymptotic values of B, A, a and q determined by solving the following equations: 

t 



(33) 



(34) 



g— ikx(y) — ^k 2 a 2 t 



A = Bo 1 



A = — lim 



B=lim vW 

t^oo Q(l-q) 



r] 2 /a 2 + A 2 + (l-a' 1 ) lim (qQ-R 2 )/t 2 = 2B 2 a 2 lim Q(l-q)/t 

t— >oo t^oo 

Inspection shows that these four asymptotic equations are solved by 



lim A = r\l y/a, 

t— >oo 



lim q 

t— >oo 



1 
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Figure 1: On-line Hebbian learning, simulations versus theoretical predictions, for rj = 1 and a £ 
{0.25,0.5, 1.0,2.0,4.0} (N = 10,000). Upper curves: generalization errors as functions of time. Lower 
curves: training errors as functions of time. Circles: simulation results for E g ; diamonds: simulation 
results for E t . Solid lines: corresponding predictions of dynamical replica theory. 



so that 



lim P t [k/t\y] = e~ ikr 'l. a 1 wW+vyfiF] -%v 2 k*/c 



t— >oo 



(35) 



Comparison with (18,19) shows that this prediction (^g) is again exact. Thus the same is true for the 
asymptotic training error. 

Finally, in order to arrive at predictions with respect to .P[a;|y] and E t for intermediate times 
(without rigorous analytical solution of the functional saddle-point equation), and in view of the 
conditionally-Gaussian form of the field distribution both at t = and at t = oo, it would appear to 
make sense for us to approximate and M[x|y] by simple conditionally Gaussian distributions 

at any time: 

-i[x-5(i/)] a /A a , a 

P[x\y] = ir - f = , M[x\y] 



-l[ x -x(y)] 2 /^ 



AV27T L " J cjv 7 ^ 

with the (exact) first moments x(y) = Ry + r]ta~ l sgn(y), and with the variance A 2 self-consistently 
given by the solution of: 



A 2 = a 2 +£V 



B 



VqQ-R 2 

Q(l-q) 



.2 , V 2 t 2 



a 



d a9 rj 2 , 2ri 2 t(7 2 

—A 2 = —+t] 2 -\ 

dt a aQ{\ — q 



aA 2 + -! — + (qQ-R 2 )(a-l) = ao l 



a 

?2 



Q(l~q) + 
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The solution of the above coupled equations behaves as 



A 2 = Q - R 2 + rft/a + 0(t 3 ) (t -> 0) 

A 2 = (Q-R 2 )[l+0(t- 1 )] (t^oo) 

for short and long times, respectively (note Q—R 2 ~ t 2 as i — > oo) . Thus we obtain a simple approximate 
solution of our equations, which extrapolates between exact results at the temporal boundaries t = 
and £ = oo, by putting 

A 2 = Q-R 2 +rft/a 



with Q and i? given by our previous exact result 



1 



E„ = — arccos 

6 7T 



R 



, one obtains 
1 



E t = -- 1 -jDyerf 
We can also calculate the student field distribution P{x) = fDy P[x\y], giving 



y\R+rjt/a 



AV2 



P x) 



1— erf 



+ 



e 2 



lr ^-^] 2 /(A 2 +J R 2 ) 



2V2tt(A 2 +^) 



1+erf 



R[x-& 



A^A^+W) 



(36) 



(37) 



In figure |l] we compare the predictions for the generalization and training errors (^6|) of the ap- 
proximate solution of our equations with the results obtained from numerical simulations of on-line 
Hebbian learning for N = 10,000 (initial state: Qo = 1, Ro = 0; learning rate: ry = 1). All curves 
show excellent agreement between theory and experiment. For E g this is guaranteed by the exactness 
of our theory for Q and R; the agreement found for E t is more surprising, in that these predictions 
are obtained from a simple approximation of the solution of our equations. We also compare the the- 
oretical predictions made for the distribution -P[:c|y] with the results of numerical simulations. This 
is done in figure ^, where we show the fields as observed at time t = 50 in simulations (N = 10, 000, 
r] = 1, Rq = 0, Qq = 1) of on-line Hebbian learning, for three different values of a. In the same 
figure we draw (as dashed lines) the theoretical prediction ( p2] ) for the y-dependent average of the 
conditional ^-distribution Finally we compare the student field distribution P(x), as observed 

in simulations of on-line Hebbian learning (N = 10,000, rj = 1, Rq = 0, Qo = 1) with our prediction 
(37). The result is shown in figure ||, for a G {4, 1, 0.25}. In all cases the agreement between theory 
and experiment, even for the approximate solution of our equations, is quite satisfactory. 
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Figure 2: Comparison between simulation results for on-line Hebbian learning (system size N = 
10,000) and dynamical replica theory, for 77 = 1 and a € {0.5,1.0,2.0}. Dots: local fields (x,y) = 
(«/•£, .B-£) (calculated for questions in the training set), at time t = 50. Dashed lines: conditional 
average of student field x as a function of y, as predicted by the theory, x(y) = Ry + (rjt/a) sgn(y). 
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Figure 3: Simulations of on-line Hebbian learning with r] = 1 and TV = 10, 000. Histograms: student 
field distributions measured at t = 10 and t = 20. Lines: theoretical predictions for student field 
distributions, a = 4 (upper), a = 1 (middle), a = 0.25 (lower). 
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4 General Approximation Schemes 



All three approximation schemes presented in this section aim at providing alternatives to calculating 
the effective measure M[x|y] at each time step from the functional saddle-point equation. Since this 
calculation cannot (yet) be done analytically, it constitutes a significant numerical obstacle in working 
out the predictions of our theory. Each scheme preserves both normalisation and symmetries of the 
probability density P[x,y] and its marginals, as well as the relation Jdx P[x\y]&[x, y] = for all y. 
In the first two approximation schemes, a large a expansion and a conditionally-Gaussian saddle- 
point approximation, all Gaussian integrals representing the disorder in the problem can be done 
analytically; this leads to a significant reduction in CPU time when solving our equations numerically 
(especially the large a approximation is extremely simple and fast, as it does not even involve a saddle- 
point equation for q). We only work out the equations for on-line learning; the batch laws follows as 
usual upon expanding the equations in powers of r/ and retaining only the linear terms. 



4.1 Large a Approximation 

Our first approximation scheme is obtained upon taking into account the finite nature of the training 
set (i.e. the disordered nature of the dynamics) in first non-trivial order. The amount of disorder is 
effectively measured by the parameter B, or, equivalently, by the deviation of the value of the spin- 
glass order parameter q from its naive value R 2 /Q. Putting B = in the saddle-point equation (|7j) 
immediately gives lim^^o -^My] = -PMy]> so we write 

M[x\y] = P[x\y] [1 + ]T B e m e [x\y\], f dx P[x\y}m e [x\y] = (38) 

Upon inserting (|3g|) as an ansatz into the saddle-point equation (^) , one easily shows that 

M[x\y] = P[x\y] r^W+^PW^) 2 ]^) ( 39 ) 

with the abbreviations 

x(y) = dx P[x\y]x x 2 (y) = dx P[x\y]x 2 



(the second 0(B 2 ) term in the exponent of (|39|), being independent of x, just reflects the normalisation 
requirements). This result enables us, in turn, to expand the function <3?[x,y] which controls the non- 
trivial term in our diffusion equation for Note that from the definition of B it follows that 
Q(l-g) = \B- 2 [yfl+AB' 2 (Q-R 2 )-l\, which gives 

With this expression we can write our approximate equations in explicitly closed form (i.e. without 
any remaining saddle-point equations). The relevant scalar functions become 

u= (g[x,y){x-x{y)}) y ={xg[xM) w = {yg[xM) Z = (g 2 [x,y}) (40) 



For on-line learning we find: 



^-Q = 2r]V + r] 2 Z ^-R = r]W (41) 



14 



d_ 

dt 



P[x\y] = - [dx'P[x'\y] \8\x-x' -•ng[x',y]]-5[x-x'\\ - n-^- \p\x\y] [U(x- Ry)+Wy] 
a J ox I 



1 d 2 



V- 


-RW 


- U 


-\ 


[ Q- 


-R 2 


dx { 



P[x\y] [x-x(y)j 



(42) 



From the solution of the above equations follow, as always, the training- and generalization errors 
E t = jDydx P[x\y]9[— xy] and E g = ir^ 1 arccosf-R/y^. The resulting theory is obviously exact in 
the limit a — > oo (see ]l|]), by construction. 

4.2 Conditionally-Gaussian Approximation 

Our basic idea here is a variational approach to solving the functional saddle-point problem (valid for 
any a), i.e. to carry out the functional extremisation only within the restricted family of conditionally 
Gaussian measures M[x|y] (which, together with q, characterises the saddle-point): 



e 2 



M[x\y] 

a(y)v27r 

Note that this does not imply the stronger statement that -P[x|y] itself is taken to be of a conditionally- 
Gaussian form (as in the case of the approximation used for on-line Hebbian learning). Extremisation 
of the original replica-symmetric functional ^[q, {M}] (see |l|]) within the conditionally-Gaussian fam- 
ily of functions results in the requirement that the two y-dependent moments x(y) and cr 2 (y) be given 
by 



x(y) = dx xP[x\y], 



A\y) = Jdx x 2 P[x\y] -x\y) = a\y)+B z a\y) 

Now we can again calculate all relevant averages which involve the effective measure M[x|y] exactly. 
In particular: 



(x>* = x{y) + zBa 2 {y) 



D 



VqQ-R 2 



<f>[x,y] 



e 2 



i[,-,W] 3 /A 2 W ( x -x( y ))a 2 (y) 



<9(l-9) ~ L '" J A(y)V2^P[x\y] Q(l-q)A*(y) 

For on-line learning this results in the following approximated theory: 

ua 2 {y)G\x{y)+uA(y),y) \ 



(43) 
(44) 

(45) 



V-J^X Q(l- q) A {y) j 
V=(xG[x,y]) W=(yG[x,y]) Z = (G 2 [x,y}) 



d_ 
~dt 



Q = 2r)V + tfZ 



d_ 

dt 



R = r]W 



dt 



P[x\y) 







dx'P[x'\y] \5[x-x' -r]g\x',y}]-5[x-x'}] - r]— \ P[x\y] [U(x- Ry)+Wy] 

ox 1 



+ \ 2 z 92 p\xM wHy) [ y - RW -(Q- R2 ) u ] 

+ ^ Z-^P[x\y] ^ Q(1 _ !)A5(I() 



A 2 {y)-{x-x(y)f 



e 2 



i[x-x(y)] 2 /A 2 (y) 



The remaining order parameter q is calculated at each time-step by solving 



{{x-Ryf) + (qQ-R 2 )(l- 



a 



2^-^ + 1 



Dy a 2 {y) 



From the solution of these equations follow the training- and generalization errors E% = JDydx P[x\y]6[— xy] 
and Eg = it^ 1 arccos[i?/\/Q]- 
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4.3 Partially Annealed Approximation 



In order to construct our third and final approximation we return to an earlier stage of the derivation 
of the present formalism (see |l]]), and rewrite the functional saddle-point equation in a form where 
the replica limit n — > has not yet been taken, i.e. 

— r — i n—1 

JDz M n [x\y}e Bz ^ x ~ x ^ fdx' M n [x'\y]e Bz ^'- x ^ 
for all x,y : P[x\y] = 



JDz [fdx' M n [x'\y]e Bz l x '- x (y^] n 

with x(y) = fdx xP[x\y]. In our full (quenched disorder) calculation we find ourselves with the 
effective measure M[x|y] = lim n _+o M n [x|y]. In contrast, an alternative calculation, whereby the 
quenched average over all training sets would have been replaced by an annealed average over all 
training sets, would have led us to the value n = 1 rather than n = 0: M[x|y] = Mi[a:|y]. We can 
now define in a natural way an annealed approximation of our theory upon replacing the complicated 
n = functional saddle-point equation ([?]) by the much simpler n = 1 version: 

JDz M[x\y}e Bz ^ x - x ^ 
™ ~ JDz fdx' M[x'\y]e Bz l x '- x (y)] 

The z-integrations can immediately be carried out, and the resulting equation solved for M[x|y], 
giving: 

P[x\y] e 



M[x\y] 



\B*[x-x(y)f 



(46) 



fdx' P[x'\y] e -\B 2 W-<y)? ' 
Averages involving the effective measure M[x|y] are thus written explicitly in terms of P[x|y], and we 
are left with the following approximate theory: 



dt 



U = {$[x,y]g[x,y]) V=(xG[x,y]) W = (yQ[x, 

dx'P[x'\y] [5[x— x — r]Q\x\ y]] — S[x— x']] — rj 

d 2 



P[x\y] 



with 



*[X,V] 



l rZ dx 2 
1 



V Z Z— P[x\y] - 7] [V-RW-(Q-R 2 )U 
J dx P[x\y] e 



Z=(g 2 [x,y]) 

R = r]W 

P[x\y] [U(x-Ry)+Wy] 

P[x\y]&[x,y] 



0(1- 



dx 
d_ 

Ox 



\[B(x-x(y))-z^-\{B(X-x{y))-z? 



(47) 

(48) 
(49) 



Dz 



(X-x) 



~fdx P[x\y] e ~hl B ( x - x (y))- z ] 2 ] 

As always, B = v / qQ-R 2 /Q(l-q). The remaining spin-glass order parameter q is calculated at each 
time-step by solving 

'fdx P[x\y) e -\[Bi.x-x{y))-z\\>y 



((x-Ryy) + (qQ-R')(l—) 

a 



2{qQ-R')2 + 



DyDz z 



f dx P[x\y] € 



■ \[B{x-x{y))-z? 



From the solution of the above equations follow the training- and generalization errors E t = (9[— xy}) 
and Eg = ir^ 1 arccosfii/^/Q] • It should be emphasised that the present approximation is not equivalent 
to (and should be more accurate than) a full annealed treatment of the disorder in the problem; the 
latter would have affected not only the equation for M[x|y] but also the saddle-point equation for q 
(hence the name partially annealed approximation). 
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Figure 4: Numerical simulations of on-line Aelatron learning, with iV= 10,000, a=l and rj= \. The 
scatter plots show the observed student and teacher fields (x,y) = (J-£,B-£) at times i = 5 (upper 
left), t = W (upper right), t = 15 (lower left) and t = 20 (lower right), as measured during simulations 
for the data in the training set D, drawn as points in the (x, y) plane. Note the development over 
time of an increasingly narrow 'ridge' along the line x = 0. 



5 Non-Hebbian Rules: Theory versus Simulations 

Henceforth we will always assume initial states with specified values for Ro and Qq but without 
correlations with the training set, i.e. 

e -\[x-Ro V } 2 /{Q -R 2 ) 

P [x\y] = 

yJ2*(Q - Rl) 

This implies that the student could initially have some knowledge of the rule to be learned, if we 
wish, but will never know beforehand about the composition of the training set. We will inspect 
the learning dynamics generated upon using two of the most common non-Hebbian (error-correcting) 
learning rules: 

Perceptron : Q[x,y] = sgn(y)#[— xy] 

(50) 

AdaTron : G[x,y] = \x\ sgn(y)9[— xy] 
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Note that in the case of AdaTron learning the cases rj < 1 and r/ > 1 give rise to qualitatively different 
behaviour of the first term in the diffusion equation @. For r\ < 1 the learning process, aiming at 
the situation where xy > never occurs, remedies inappropriate student fields by slowly moving them 
towards (but not immediately across) the decision boundary. For ij > 1 the adjustments made to 
the student fields could move them well into the region at the other side of the decision boundary. 
The case rj = 1 is special, in that changes to the student fields tend to move them precisely onto the 
decision boundary. The student field distribution consequently develops a 5-peak at the origin, in 
perfect agreement with what can be observed in numerical simulations (see e.g. the figures referring 
to on-line AdaTron learning with rj = 1 in |ffl]): 

rj = 1 : Jt^^ = a J dX ' 9 ^'y\ P \. x '\y\ ~ P[x\y}0^-xy]j + ... 

In fact the same occurs for all rj < 1: about half of the probability weight of -P[x|y] will in due 
course become concentrated in an increasingly thin ridge along the decision boundary x = 0. This is 
illustrated in figure ||, for r/ = A. Since such a singular behaviour (although in principle accurately 
described by our equations) will be difficult to reproduce when solving the equations numerically, using 
finite spatial resolution, we will in this paper only deal with the case of rj > 1 for AdaTron learning. 

5.1 Large a and Conditionally-Gaussian Approximations 

Our first approximated theory (the large a approximation) is very simple, with neither saddle-point 
equations to be solved nor nested integrations. As a result, numerical solution of the macroscopic 
equations is straightforward and fast. In figures || (on-line perceptron learning) and|6| (on-line Adatron 
learning) we compare the results of solving the coupled equations (^0[^^) numerically for finite 
values of a, plotting the generalisation- and training errors as functions of time, with results obtained 
from performing numerical simulations. As could have been expected, the large a approximation 
under-estimates the amount of disorder in the learning process, which immediately translates into 
under-estimation of the gap between E t and E g (which is its fingerprint). It is also clear from these 
figures that, although at any given time the quality of the predictions of this approximation does 
improve when a increases (as indeed it should), and although there is surely qualitative agreement, 
reliably accurate quantitative statements on the values of the training- and generalisation errors are 
confined to the regime r]t < a. Yet, surprisingly, the agreement obtained is very good, even for 
r]t > a. Apparently the present approximation does still capture the main characteristics of the 
(non-Gaussian) joint field distribution. This is illustrated quite clearly and explicitly in figures [7| 
and H where we compare for a fixed time t = 10 the student and teacher fields as measured during 
numerical simulations (for N = 10, 000, drawn as dots in the (x, y) plane) for the p = aN questions 
^ in the training set D, to the theoretical predictions for the joint field distribution P[x, y] (drawn as 
contour plots). We will not at this stage attempt to explain the surprising effectiveness of the large a 
approximation for small values of a (note that figures | and | even suggest an increase in accurateness 
as a is lowered below a = 1). This would require a systematic mathematical analysis of the non-linear 
diffusion equation (^), which we consider to be beyond the scope of the present paper. 

The conditionally-Gaussian approximation again involves no nested integrals, and its equations 
can therefore still be solved numerically in a reasonably fast way, but it does already require the 
solution (at each infinitesimal time step) of a scalar saddle-point equation to determine the spin-glass 
order parameter q. Approximations of this type work extremely well for the simple Hebbian learning 
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Figure 5: Comparison between the large a approximation of the theory and numerical simulations 
of on-line perceptron learning with N = 10,000 and 77 = 1. Markers: training errors E t (circles) 
and generalisation errors E g (squares) ; finite size effects in the simulation data are of the order of the 
marker size. Lines: theoretical predictions for training errors (solid) and generalisation errors (dashed) 
as functions of time, according to the approximated theory. Training set sizes: a = 4 (upper left), 
a = 2 (upper right), a = 1 (lower left), and a = 0.5 (lower right). 
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Figure 6: Comparison between the large a approximation of the theory and numerical simulations 
of on-line Adatron learning with N = 10, 000 and 7] = 2. Markers: training errors E t (circles) 
and generalisation errors E g (squares) ; finite size effects in the simulation data are of the order of the 
marker size. Lines: theoretical predictions for training errors (solid) and generalisation errors (dashed) 
as functions of time, according to the approximated theory. Training set sizes: a = 4 (upper left), 
a = 2 (upper right), a = 1 (lower left), and a = 0.5 (lower right). 
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Figure 7: Comparison between the large a approximation of the theory and numerical simulations of 
on-line Perceptron learning, with N = 10,000 and rj = l. Scatter plots (left): observed student and 
teacher fields (x, y) = (•/•£, B£.) as measured at time t = 10 during simulations, for the data in D, drawn 
in the (x, y) plane. Contour plots (right): corresponding predictions for the joint field distribution 
P[x, y], according to the approximated theory. Training set sizes: a = 0.5, 1, 2, 4 (from top to bottom). 
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Figure 8: Comparison between the large a approximation of the theory and numerical simulations of 
on-line Adatron learning with N = 10, 000 and r] = 2. Scatter plots (left): observed student and teacher 
fields (x,y) = («/•£,!?•£) as measured at time t = 10 during simulations, for the data in D, drawn 
in the (x,y) plane. Contour plots (right): corresponding predictions for the joint field distribution 
P[x, y], according to the approximated theory. Training set sizes: a = 0.5, 1, 2, 4 (from top to bottom). 
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rules, as we have seen earlier. However, numerical solution of the coupled equations (43,44,45) shows 
quite clearly that for the more sophisticated non-Hebbian rules such as Perceptron and AdaTron, 
which are of an error correcting nature (i.e. where where changes are made only when student and 
teacher disagree), the conditionally-Gaussian approximation is less accurate than the previously in- 
vestigated large a approximation, in spite of the fact that the latter involved much simpler equations. 
Apparently the generally non-Gaussian nature of the conditional distribution and thereby of 

the measure M[x|y], is of crucial importance. It is not good enough to try getting away with allowing 
the y-dependent averages x(y) and variances A(y) to be non-trivial functions. With conditionally- 
Gaussian measures M[a;|y] it turns out that generating the right width of the conditional distributions 
inevitably introduces tails for which spill into the xy < region, which are found to be 

absent in error-correcting learning rules such as Perceptron and Adatron. This picture is consistent 
with figures [7| and ||, where we can observe that for any fixed value of the teacher field y the remaining 
marginal distribution for x is generally not symmetric around its (y-dependent) average. We conclude 
that the conditionally-Gaussian approximation is generally inferior to the large a approximation. We 
will not waste paper by producing large numbers of graphs to illustrate this explicitly and compre- 
hensively, but we will rather draw the conditionally-Gaussian predictions together with those of the 
other approximations and of the full theory, by way of illustration. 



5.2 Partially Annealed Approximation and Full Equations 

The partially annealed approximation and the full theory are both expected to improve upon the 
large a approximation (note that the partially annealed approximation can be seen as an improved 
version of the large a approximation, similar in structure but valid also for small a, i.e. large B). 
Although the partially annealed approximation does not involve a functional saddle-point equation to 
be solved (which improves numerical speed), it shares with the full theory the appearance of nested 
(Gaussian) integrals, namely those appearing in the function $[ic,y] and in the saddle-point equation 
for q. Thus, solution of both the full theory and of the partially annealed approximation involves a 
significant amount of CPU time (avoiding standard instabilities of discretised diffusion equations sets 
further limits on the maximum size of the time discretisation, dependent on the field resolution ||]), 
which implies that we have to reduce our ambition and restrict the number of experiments to a few 
typical ones. 

We will thus investigate two examples, both with a = 1: on-line Perceptron learning with rj = |, 
and on-line AdaTron learning with rj = I. We solve numerically the full equations of our theory, 
i.e. the macroscopic dynamical laws (|2],[|) with the order parameters calculated at each time step 
by solving (^J^) , and show in figure || the training and generalisation errors as functions of time to- 
gether with the corresponding values as measured during numerical simulations, with systems of size 
N = 10, 000. In addition, we plot in the same picture, for comparison, the training- and generalisation 
errors obtained by numerical solution of the three approximated theories as derived in the previous 
section. In comparing curves we have to take into account that those describing the large a approxi- 
mation were generated upon solving the diffusion equation with a significantly higher numerical field 
resolution (Ax = 0.015) than the others (where we used Ax = 0.05), because of CPU limitations. 
A restricted field resolution is likely to be more critical at large times, where the probability weight 
in the xy < region, responsible for the residual error and for the non-stationarity of the dynamics, 
is highly concentrated close to the decision boundary x = 0. Especially for large times, we should 
therefore expect the full theory, the conditionally-Gaussian approximation, and the partially annealed 
approximation to all three perform better in reality than what is suggested by the numerical solutions 
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Figure 9: Comparison between the full numerical solution of our equations, as well as the three 
approximations of the theory, and the results of doing numerical simulations of on-line learning with 
N = 10,000 and a = 1. Markers: training errors E t (circles) and generalisation errors E g (squares); 
finite size effects are of the order of the size of te markers. Lines: theoretical predictions for training 
errors (lower) and generalisation errors (upper) as functions of time, according to the theory. The 
different line types refer to: full equations (solid), annealed approximation (dashed), conditionally- 
Gaussian approximation (dashed-dotted) and large a approximation (dotted) (note: the dashed and 
solid curves fall virtually on top of one another). Left picture: Perceptron learning, with r\=\. Right 
picture: AdaTron learning, with 77 = |. 

of their equations as shown in figure ^. This is particularly true for AdaTron learning, where even 
for 77 > 1 (where we do not expect to observe a 5-singularity) the field distributions still tend to 
develop a jump discontinuity at x = 0. It turns out that the curves of the full theory and those of 
the partially annealed approximation are very close (virtually on top of one another for the case of 
Perceptron learning) in figure [|; apparently for the learning times considered here there is no real 
need to evaluate the full theory. 

Finally, we show in figure [l(] for both the full theory and for the simulation experiments the two 
distributions P ± (x) = Jdy P[x, y]9{±y] for the student fields, given a specified sign of the teacher 
field y (and thus a given teacher output), corresponding to the same experiments. Note that P{x) = 
P + (x) + P~ (x) . The pictures in figure |l0| again illustrate quite clearly the difference between learning 
with restricted training sets and learning with infinite training sets: in the former case the desired 
agreement xy > between student and teacher is achieved by a qualitative deformation of away 
from the initial Gaussian shape, rather than by adaptation of the first and second order moments. 
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Figure 10: Comparison between the full numerical solution of our equations and the results of doing 
numerical simulations of on-line learning with N = 10, 000 and a = 1. Histograms: conditional student 
field distributions P ± (x) = JdyP[x, y]9[±y] as measured at time t = 5. Smooth curves: corresponding 
theoretical predictions. Upper pictures: Perceptron learning, with r\ = 1 (left: P~(x), right: P + (x)). 
Lower pictures: AdaTron learning, with 77 = | (left: P~(x), right: P + (x)). 

Our restricted resolution numerics obviously have difficulty in reproducing the discontinuous be- 
haviour of P ± (x) near x = for on-line Adatron learning (as expected), which explains why in this 
regime the simplest large a approximation (which can be numerically evaluated with almost arbitrar- 
ily high field resolution) appears to outperform the more sophisticated versions of the theory (which 
CPU limitations force us to evaluate with rather limited field resolution), according to figure ^. 

We conclude from the results in this section that our full theory indeed gives an adequate descrip- 
tion of the macroscopic process, and that the partially annealed approximation is almost equivalent 
in performance to the full theory. As mentioned before, the conditionally-Gaussian approximation 
performs generally poorly (except, as we have seen earlier, for the simple Hebian rule). Which of the 
remaining three versions of our theory to use in practice will clearly depend on the accuracy con- 
straints and available CPU time of the user, with the full theory at the higher end of the market (in 
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principle very accurate, but almost too CPU expensive to work out and exploit properly), with the 
large a approximation on the lower end (reasonably acurate, but very cheap), and with the annealed 
approximation as a sensible compromise in between these two. 

6 Discussion 

Our aim in this sequel paper was to work out the general theory developed in |ffl] for several supervised 
(on-line and batch, linear and non-linear) learning scenarios in single-layer perceptrons, to develop a 
number of systematic approximations from the full set of equations, and to test the theory and its 
approximations against both exactly solvable benchmarks and extensive numerical simulations. The 
theory, built on the dynamical replica formalism [|[ , was designed to predict the evolution of training- 
and generalisation errors, via a non-linear diffusion equation for the joint distribution of student and 
teacher fields, in the regime where the size p of the (randomly composed) training set scales as p = aN, 
with < a < oo and where N denotes the number of inputs. In this regime the input data will in due 
course be recycled, as a result of which complicated correlations develop between the student weights 
and the realisations of the data vectors (with their corresponding teacher answers) in the training set; 
the student fields are no longer described by Gaussian distributions, training- and generalisation errors 
will no longer be identical, and the more traditional and familiar statistical mechanical formalism as 
developed for infinite training sets consequently breaks down. 

We have first worked out our equations explicitly for the special case of Hebbian learning, where 
the availability of exact results, derived directly from the microscopic equations, allows us to perform 
a critical test of the theory. For batch Hebbian learning we can demonstrate explicitly that our theory 
is fully exact. For on-line Hebbian learning, on the other hand, proving or disproving full exactness 
requires solving a non-trivial functional saddle-point equation analytically, which we have not yet been 
able to do. Nevertheless, we can prove that our theory is exact (i) with respect to its predictions for 
Q, R and Eg, (ii) with respect to the first moments of the conditional field distributions (for 
any y E K), and (iii) in the stationary state. In order to also generate predictions for intermediate 
times we have constructed an approximate solution of our equations, which is found to describe the 
results of performing numerical simulations of on-line Hebbian learning essentially perfectly. 

No exact benchmark solution is available for non-Hebbian (i.e. non-trivial) learning rules, leaving 
numerical simulations as the only yardstick against which to test our theory. Motivated by the need 
to solve a functional saddle-point equation at each time step in the full theory, and by the presence 
of nested integrations, we have constructed a number of systematic approximations to the original 
equations. We have compared the predictions of the full theory and of the three approximation 
schemes with one another and with the results obtained upon performing numerical simulations of 
non-linear learning rules, such as Perceptron and AdaTron, in large perceptrons (of size iV = 10, 000), 
with various values of learning rates rj and relative training set sizes a. One of the approximations, 
a conditionally-Gaussian saddle-point approximation in the spirit of the particular approximation 
that was found to work perfectly for Hebbian learning, turned out to perform badly for general non- 
Hebbian rules. The other two approximations, the large a approximation and the partially annealed 
approximation, each have their specific usefulness; the former is extremely simple and fast, whereas 
the latter is overall more accurate, but more expensive in its CPU requirements (so that in practice its 
true accurateness cannot always be realised). Yet, the large a approximation still works remarkably 
well, even for small a, in spite of it being so simple that it can be written as a fully explicit set of 
equations for Q, R and the joint field distribution P[x, y] only. For on-line learning these equations 
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can be simplified to 



= 2rj(xg[x,y]) + r/ 2 (G 2 [x , y]) = r)(yQ[x, y)) 

d p [x\y\ = - [ dx'P[x'\y] 1 5[x-x' -rjg[x',y}]-5[x-x']\ + \r] 2 -^P[x\y\(G 2 [x, y']) 



dt 1 |yj a J 1 |yj I 1 ' 1 ,yn L J J 2 ' dx 2 

(x-x(y))(x(y')-Ry') + (x-Ry)(x' -x(y')) 



yy' + 



Q-R 2 

with (f[x,y]) = jDydx P[x\y]f[x, y] and x(y) = Jdx xP[x\y]. The observed accurateness of these 
simple equations in the small a regime suggests that for a — > the leading term in the diffusion 
equation for -P[x|y] is the first term in the right-hand side, which reflects the direct effect of pattern 
recycling, and which indeed has not been approximated. 

For a discussion of further theoretical developments and refinements of the present dynamical 
replica formalism we refer to We believe that our theory offers an efficient tool with which to 
analyse and predict the outcome of learning processes in single-layer networks. In particular, for those 
who are primarily interested in the progress and the outcome of learning processes there is no real 
need to understand the full details of the derivation in Q; as in this paper, one can simply adopt the 
macroscopic laws of (l[ (or one of the two appropriate approximations, to save CPU time) as a starting 
point, and just apply them to the learning rules as hand. Generalization to multi-layer networks (with 
a finite number of hidden nodes) is straightforward, although numerically intensive |9]. The case 



of noisy teachers can also be studied with an appropriate extension of the present formalism [10|, 
involving a joint distribution of three rather than two fields (namely those of student, 'clean' teacher, 
and 'noisy' teacher). In the example applications worked out so far in this paper (Hebbian learning, 
Perceptron learning and AdaTron learning) our formalism has been found to be either exact or an 
excellent approximation. It is not realistic to expect that simpler theories can be found with a similar 
level of accuracy. While putting the finishing touch to this manuscript a preprint was communicated 



[11 1 in which the authors apply the cavity method to the present problem. They manage to keep their 
theory relatively simple by restrict themselves in several serious ways (to batch learning only, and to 
gradient descent learning rules in order to use FDT relations) and by applying their theory only to a 
linear learning rule. Here also the present theory would have been both simpler and exact. An exact 



theory for both on-line and batch learning and for arbitrary learning rules can be constructed [12 



using a suitable adaptation of the path integral methods as in |13|, with the obvious appeal of full 
mathematical rigour at any time, but in describing transients it is going to be more complicated than 
the present one as it will be built around macroscopic observables with two time arguments rather 
than one (correlation- and response functions) and will take the form of an effective single weight 
process with a dynamics with coloured stochastic noise and retarded self- interactions. 
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